# Vision-Language Alignment
Internvl3 14B Instruct GGUF
Apache-2.0
InternVL3-14B-Instruct is an advanced Multimodal Large Language Model (MLLM) that demonstrates exceptional multimodal perception and reasoning capabilities, supporting various tasks such as tool usage, GUI agents, industrial image analysis, and 3D visual perception.
Image-to-Text
Transformers

I
unsloth
982
1
Anon
Apache-2.0
A fine-tuned version based on the lmms-lab/llava-onevision-qwen2-7b-ov model, supporting video-text-to-text conversion tasks.
English
A
aiden200
361
0
Vit Bart Image Captioner
Apache-2.0
A vision-language model based on BART-Large and ViT for generating English descriptions of images.
Image-to-Text English
V
SrujanTopalle
15
1
TITAN
TITAN is a multimodal whole slide foundation model pre-trained through visual self-supervised learning and vision-language alignment for pathology image analysis.
Multimodal Fusion
Safetensors English
T
MahmoodLab
213.39k
37
LLM2CLIP Llama 3 8B Instruct CC Finetuned
Apache-2.0
LLM2CLIP is an innovative approach that enhances CLIP's cross-modal capabilities through large language models, significantly improving the discriminative power of visual and text representations.
Multimodal Fusion
L
microsoft
18.16k
35
IP Adapter Instruct
Apache-2.0
IP-Adapter-Instruct is an image-to-image transformation model focused on instruction-guided image editing and generation tasks.
Image Generation English
I
CiaraRowles
103
51
Cambrian 8b
Apache-2.0
Cambrian is an open-source multimodal LLM (Large Language Model) designed with a vision-centric approach.
Text-to-Image
Transformers

C
nyu-visionx
565
63
Libra 11b Base
Apache-2.0
Libra is a decoupled vision system built upon large language models, possessing fundamental multimodal understanding capabilities.
Image-to-Text
Transformers

L
YifanXu
18
0
CLIP ViT B 16 CommonPool.L.image S1b B8k
MIT
A vision-language model based on the CLIP architecture, supporting zero-shot image classification tasks
Text-to-Image
C
laion
70
0
Plip
CLIP is a multimodal vision-language model capable of mapping images and text into a shared embedding space, enabling zero-shot image classification and cross-modal retrieval.
Text-to-Image
Transformers

P
vinid
177.58k
45
M BERT Distil 40
A model based on distilbert-base-multilingual, fine-tuned to align the embedding space for 40 languages, matching the embedding space of CLIP text encoder.
Text-to-Image
Transformers Supports Multiple Languages

M
M-CLIP
46
8
Featured Recommended AI Models